Statistical thinking — descriptive statistics
2024-09-19
What is the length of flippers in male and female Adelie penguins?
Describe properties of the data using summary or descriptive statistics
Two main types of descriptive statistic; measures of location and spread
Measures of location or central tendency describe where the majority of the data are found, e.g.
How variable the data are about this location is described by measures of spread
The mean, or average, is the sum of the observations, divided by the number of observations
\[\overline{y} = \frac{1}{n}\sum\limits^n_{i=1}y_i\]
Flipper lengths for male penguins
195 + 201 + 197 + 186 + 190 + 195 + 189 + 196 + 190 + 195 + 190 + 198 + 185 + 189 + 188 = 2884
2884 / 15 = 192.27
The mean is not a robust measure of central tendency. It is heavily influenced by extreme observations
195, 201, 197, 186, 190, 195, 189, 196, 190, 195, 190, 198, 185, 189, 526.4
Replaced a value of 188 with a score of 526.4
Mean of modified data set is 214.83, larger than all observations except the modified value
A robust measure of central tendency should be less affected by extreme observations. The median is one such measure
The median is the value of a set of observations that has equal numbers of observations above and below it
If \(n\) is odd, median is the middle value of the ordered observations
If \(n\) is even, median is the midpoint between the \(n / 2\)th and \((n / 2) + 1\)th observations
If \(n\) is even, median is the midpoint between the \(n / 2\)th and \((n / 2) + 1\)th observations
Middle two observations are 187, 190
Median is 187 + 190 / 2 = 188.5
Measures of spread describe the dispersion, or variability, of our observations
The range is a simple measure of spread and is the difference between the minimum and maximum observed values
For the male Adelie penguins 185, 186, 188, 189, 189, 190, 190, 190, 195, 195, 195, 196, 197, 198, 201
Range is \(201 - 185 = 16\)
For the female Adelie penguins
174, 176, 178, 185, 186, 187, 189, 190, 190, 193, 193, 195, 196, 199, 202
Range is \(202 - 174 = 28\)
Quantiles are points taken a regular intervals from the cumulative distribution function (CDF) of a random variable
Percentiles are a type of quantile — 90th percentile score means 90% of sampled scores are lower than this percentile
The median is a quantile — it is the 50th percentile
Quartiles break ordered sample into 4 regions
Deciles break the ordered sample in 10 regions
Values of the sample quantiles don’t depend on the mean or median of the sample
Range doesn’t tell us much about how variable observations are between the extremes & is heavily influenced by extreme values
Better measure would trim off some proportion of the smallest & largest observations before computing the range
The IQR is the difference between the upper (75th quantile) and lower (25th) quartiles of the data
185, 186, 188, 189, 189, 190, 190, 190, 195, 195, 195, 196, 197, 198, 201
Lower & upper quartiles are: 189, 195.5
IQR is: 6.5
For a random variable \(y\), the variance \(\sigma^2(y)\) is a measure of how far the observations differ from the expected value (the mean)
How should we measure deviations from the mean? The sum of squares is a logical place to start
\[ \hat{\sigma}_y^2 = \sum_{i = 1}^n (y_i - \bar{y})^2 \]
The mean of the sum of squares is a value known as the mean square — uses average deviation from sample mean as estimate of population variance
\[ \hat{\sigma}_y^2 = \frac{1}{n}\sum_{i = 1}^n (y_i - \bar{y})^2 = \frac{\displaystyle \sum_{i = 1}^n (y_i - \bar{y})^2}{n} \]
Problem: mean square is a biased estimator of \(\sigma^2\)
Leads to the concept of degrees of freedom — the number of independent pieces of information that we can use to estimate statistical parameters
For an unbiased estimate of \(\sigma^2\), divide sums of squares by \(n-1\) not \(n\)
Why \(n - 1\)? Usual answer: We have used 1 parameter to estimate the mean; we have \(n - 1\) independent pieces of information left
Sample variance is defined as
\[ \hat{\sigma}_y^2 = \frac{1}{n - 1} \sum_{i=1}^n (y_i - \bar{y})^2 \]
Variance is measured in squared units relative to \(y\)
Taking the square root of \(\hat{\sigma}^2\) gives the standard deviation, which is measured on the same scale as \(y\)
\[ \hat{\sigma} = \sqrt{\hat{\sigma}^2} = \sqrt{\frac{1}{n - 1} \sum_{i=1}^n (y_i - \bar{y})^2} \]
Now \(\hat{\sigma}\) is measured in the same units as \(y\) — easier to understand
Note we need at least two observations to compute \(\hat{\sigma}^2\) and \(\hat{\sigma}\)
Mean male flipper length: 192.27
Deviations from mean:
2.73, 8.73, 4.73, -6.27, -2.27, 2.73, -3.27, 3.73, -2.27, 2.73, -2.27, 5.73, -7.27, -3.27, -4.27
Squared deviations from mean:
7.47, 76.27, 22.4, 39.27, 5.14, 7.47, 10.67, 13.94, 5.14, 7.47, 5.14, 32.87, 52.8, 10.67, 18.2
Sum of squares: 314.93
Variance: \(\hat{\sigma}^2 = \frac{314.93}{15 - 1}\) = 22.5
Standard deviation: \(s = \sqrt{\frac{314.93}{15 - 1}}\) = 4.74
A robust alternative to \(\sigma^2\) and \(\sigma\) is the mean absolute deviation (MAD)
Rather than using squared differences, MAD uses absolute deviations form the sample mean
\[\text{MAD} = \frac{1}{n} \sum_{i=1}^n \left | y_i - \bar{y} \right |\]
Mean coping score for stroke patients: 192
Deviations from mean:
2.73, 8.73, 4.73, -6.27, -2.27, 2.73, -3.27, 3.73, -2.27, 2.73, -2.27, 5.73, -7.27, -3.27, -4.27
Absolute deviations from mean:
2.73, 8.73, 4.73, 6.27, 2.27, 2.73, 3.27, 3.73, 2.27, 2.73, 2.27, 5.73, 7.27, 3.27, 4.27
Sum of absolute deviations: 62.27
MAD: \(s^2 = \frac{62.27}{15 - 1}\) = 7.41
Difficult to compare standard deviations of two samples if measured in different units — \(\sigma\) depends upon the sample mean
The coefficient of variation is the standard deviation divided by the sample mean. Often multiplied by 100 to express result as a percentage
\[ \text{CV} = \left ( \frac{\hat{\sigma}}{\bar{y}} \right ) \times 100 = \left (\frac{\displaystyle \sqrt{\frac{1}{n-1} \sum_{i=1}^n (y_i - \bar{y})^2}}{\displaystyle \frac{1}{n}\sum\limits^n_{i=1}y_i} \right ) \]
Two other descriptive statistics you might encounter are skewness & kurtosis
They describe departures from symmetry compared with the Gaussian distribution
Skewness describes how far the sample differs from a symmetrical distribution
Kurtosis describes how probability density is distribution in the tails of a distribution
Rarely mentioned in the literature but some software will churn these out alongside other descriptive statistics
Statistical properties are not very good; sensitive to outliers & to difference in the means of distributions being compared
A plot will suffice
The standard deviation & the standard error of the mean are often confused
The standard deviation (\(\hat{\sigma}\)) is a measure of the deviation of observations about the mean
The standard error (of the mean) is a measure of how variable or uncertain the estimate (\(\bar{y}\)) of the population mean (\(\mu\)) is
\[\hat{\sigma}_{\overline{y}} = \frac{\hat{\sigma}}{\sqrt{n}}\]
A large standard error, relative to the size of the mean, would indicate lots of variability in the means we would observe if we took a large number of samples (of the same size) from the population